Take-home Exercise 3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods

Author

Luo Yuming

Published

October 24, 2024

Modified

November 15, 2024


1. Overview

This exercise focuses on predicting HDB resale prices in Singapore using geographically weighted machine learning methods. By incorporating spatial components, we aim to understand regional patterns and improve prediction accuracy.


2. Objectives

The goals of this exercise are to:

  1. Apply geographically weighted machine learning models to predict HDB resale prices.
  2. Assess the performance of different models, including random forest and geographically weighted random forest (GWRF).
  3. Visualize and interpret spatial variations in predicted prices.

3. Getting Started

3.1 Required Packages

In this exercise, we will use the following packages:

Package Description
sf Provides functions for reading, processing, and visualizing spatial data in the “Simple Features” format, enabling spatial data handling in R.
spdep Provides tools for spatial dependency modeling, including spatial weights and measures for spatial autocorrelation, such as Moran’s I, useful for detecting spatial patterns.
tidyverse A suite of R packages designed for data manipulation (dplyr, tidyr), visualization (ggplot2), and other common data science tasks, improving data handling and analysis.
tmap A flexible package for creating static and interactive maps, allowing cartographic-quality visualizations of spatial data.
GWmodel Contains functions for Geographically Weighted Regression (GWR) and other spatially weighted models, allowing local modeling of spatial data where relationships can vary across geographic space.
caret A comprehensive package for machine learning, providing tools for model training, tuning, and evaluation, supporting methods like cross-validation and hyperparameter tuning.
ranger An efficient implementation of the random forest algorithm optimized for large datasets, used for predictive modeling and capable of handling both classification and regression tasks.
httr Facilitates HTTP requests in R, useful for connecting to APIs like OneMap to retrieve geographic coordinates based on addresses.
jsonlite A package for working with JSON data in R, enabling easy conversion of JSON data from web APIs into R data frames.
Code
# Load necessary packages
pacman::p_load(tidyverse, sf, spdep, GWmodel, tmap, caret, ranger, httr, jsonlite, ggplot2, ggpubr, ggstatsplot, units, olsrr, gtsummary, SpatialML,Metrics)

3.2 The Data

Dataset Name Description Format Source
Master Plan 2019 Subzone Geospatial data representing the boundaries of different regions in Singapore. Shapefile Singapore Land Authority
HDB Resale Transactions Aspatial data containing information on HDB resale prices, house type and location. CSV Singapore Open Data Portal
Various Facilities Data Includes data on the location of elderly care centres, supermarkets, hawker centres, parks, etc. for analysing the impact of proximity on price. GeoJSON Various open sources in Singapore
MRT/Bus Stations Data on the location of MRT and Bus stations in Singapore is used to calculate the impact of commuting convenience on house prices. Shapefile Singapore Land Transport Authority

4. Data Preprocessing

4.1 Loading and Filtering Resale Data

4.1.1 Geospatial Data

Code
mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL") %>%
  st_transform(3414)
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
Code
bus_stops <- st_read(dsn = "data/geospatial", layer = "BusStop") %>%
  st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21
Code
train_stations <- st_read(dsn = "data/geospatial", layer = "RapidTransitSystemStation") %>%
  st_transform(crs = 3414) %>%
  filter(grepl("STATION|MRT", STN_NAM_DE))
Reading layer `RapidTransitSystemStation' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 231 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
Note

This step filters the train station dataset to retain only actual station locations, removing entries such as depots or non-station facilities. We use grepl("STATION|MRT", STN_NAM_DE) to select records where STN_NAM_DE contains “STATION”or”MRT”, ensuring that only transit stations are included for accurate spatial analysis.

Code
hawker_center <- st_read("data/geospatial/HawkerCentresGEOJSON.geojson") %>%
  st_transform(crs = 3414)
Reading layer `HawkerCentresGEOJSON' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial/HawkerCentresGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Code
clinics <- st_read("data/geospatial/CHASClinics.geojson") %>%
  st_transform(crs = 3414)
Reading layer `CHASClinics' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial/CHASClinics.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1193 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.5818 ymin: 1.016264 xmax: 103.9903 ymax: 1.456037
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Code
supermarkets <- st_read("data/geospatial/SupermarketsGEOJSON.geojson") %>%
  st_transform(crs = 3414)
Reading layer `SupermarketsGEOJSON' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial/SupermarketsGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
Code
schoolzone <- st_read("data/geospatial/LTASchoolZone.geojson") %>%
  st_transform(crs = 3414)
Reading layer `LTASchoolZone' from data source 
  `/Users/lucasluo/Desktop/SMU/Courses/Term3 Aug-Dec/ISSS626-Geospatial Analytics and Applications/lucasluo6/ISSS626/Take-home_Ex/Take-home_Ex03/data/geospatial/LTASchoolZone.geojson' 
  using driver `GeoJSON'
Simple feature collection with 211 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: 103.687 ymin: 1.272736 xmax: 103.9668 ymax: 1.457587
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

4.1.2 Aspatial Data

Code
resale <- read_csv("data/aspatial/resale.csv") %>%
  filter(month >= "2023-01" & month <= "2024-09")
Note

Note loads the resale data and filters it to only include transactions from January 2023 to September 2024. glimpse() provides an overview of the data structure and variables.

Code
get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, 
                            postal = postal, 
                            latitude = lat, 
                            longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, 
                                postal = NA, 
                                latitude = NA, 
                                longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, 
                              postal = postal, 
                              latitude = lat, 
                              longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, 
                            postal = NA, 
                            latitude = NA, 
                            longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

4.2 Z-Dimensions Removal

We’ll use the st_zm() function to remove the z-dimension from the data.

Code
hawker_center <- st_zm(hawker_center)
clinics <- st_zm(clinics)
supermarkets <- st_zm(supermarkets)
schoolzone<- st_zm(schoolzone)

4.3 Duel with invalid geometries

Code
# Check for invalid geometries
datasets <- list(mpsz = mpsz, bus_stops = bus_stops, train_stations = train_stations, 
                 hawker_center = hawker_center, clinics = clinics, 
                 supermarkets = supermarkets, schoolzone = schoolzone)

for (name in names(datasets)) {
  invalid_count <- sum(!st_is_valid(datasets[[name]]))
  cat("Number of invalid geometries in", name, ":", invalid_count, "\n")
}
Number of invalid geometries in mpsz : 9 
Number of invalid geometries in bus_stops : 0 
Number of invalid geometries in train_stations : 2 
Number of invalid geometries in hawker_center : 0 
Number of invalid geometries in clinics : 0 
Number of invalid geometries in supermarkets : 0 
Number of invalid geometries in schoolzone : 0 
Code
# Repair invalid geometries
for (name in names(datasets)) {
  if (sum(!st_is_valid(datasets[[name]])) > 0) {
    datasets[[name]] <- st_make_valid(datasets[[name]])
    cat("Repaired invalid geometries in", name, "\n")
  }
  
  # Check again to confirm all geometries are now valid
  invalid_count <- sum(!st_is_valid(datasets[[name]]))
  cat("After repair, number of invalid geometries in", name, ":", invalid_count, "\n")
}
Repaired invalid geometries in mpsz 
After repair, number of invalid geometries in mpsz : 0 
After repair, number of invalid geometries in bus_stops : 0 
Repaired invalid geometries in train_stations 
After repair, number of invalid geometries in train_stations : 0 
After repair, number of invalid geometries in hawker_center : 0 
After repair, number of invalid geometries in clinics : 0 
After repair, number of invalid geometries in supermarkets : 0 
After repair, number of invalid geometries in schoolzone : 0 

4.4 Data Transformation

Code
resale_tidy <- resale %>%
  mutate(address = paste(block,street_name)) %>%
  mutate(remaining_lease_yr = as.integer(
    str_sub(remaining_lease, 0, 2)))%>%
  mutate(remaining_lease_mth = as.integer(
    str_sub(remaining_lease, 9, 11)))
Note

Note This transformation combines block and street name into a single address field and extracts remaining_lease_yr and remaining_lease_mth for further analysis.

Code
resale_selected <- resale_tidy %>%
  filter(month == "2024-09")
Code
add_list <- sort(unique(resale_selected$address))
Note

Note filters the dataset to only include September 2024 transactions and extracts unique addresses to be geocoded.

4.5 Saving the Geocoded Coordinates

Code
coords <- get_coords(add_list)
Note

Note This function get_coords uses the OneMap API to retrieve coordinates (latitude and longitude) for each unique address in add_list. It handles cases where multiple or no results are returned for an address.

Code
write_rds(coords, "data/rds/coords.rds")
Note

Note The geocoded coordinates are saved to an RDS file to avoid re-running the API calls, making future analyses more efficient.

4.6 Data Wrangling and Joining Coordinates

Code
# Join coordinates with filtered resale data
resale_geo <- resale_selected %>%
  left_join(coords, by = "address") %>%
  filter(!is.na(latitude) & !is.na(longitude))%>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)
Note

Note This code joins the geocoded coordinates with the filtered resale data, creating a geospatial dataset with latitude and longitude fields for further spatial analysis.

4.7 Data Data Visualization

Code
# Enable automatic fixing of invalid polygons
tmap_options(check.and.fix = TRUE)

# Switch to interactive view mode
tmap_mode("view")

# Interactive map with mpsz as base
tm_shape(mpsz) +
  tm_polygons(col = "white", border.col = "grey") +
  
  # Add layers interactively
  tm_shape(resale_geo) +
  tm_symbols(col = "yellow", size = 0.1, shape = 19) +
   
  tm_shape(clinics) +
  tm_symbols(col = "blue", size = 0.1, shape = 19) +
  
  tm_shape(hawker_center) +
  tm_symbols(col = "red", size = 0.15, shape = 21) +
  
  tm_shape(train_stations) +
  tm_symbols(col = "green", size = 0.1, shape = 19) +
  
  tm_shape(bus_stops) +
  tm_symbols(col = "grey", size = 0.05, shape = 20) +

  tm_shape(supermarkets) +
  tm_symbols(col = "brown", size = 0.1, shape = 19) +
  
  tm_layout(title = "Interactive Map of Amenities in Singapore")
Note

This visualization displays various amenities across Singapore, including healthcare facilities, childcare centers, eldercare centers, public transit stations, parks, and retail stores. Each type of location is represented with a unique color and symbol to make it easy to distinguish. This map can help analyze the spatial distribution of these amenities and their accessibility to different residential areas.

4.7.1 Filtering Data to Singapore Boundaries

Note Some bus stops were located outside Singapore’s boundary in the initial visualization. To ensure that only relevant data within Singapore is included, we filter the bus_stops dataset to retain only those points that fall within the geographical boundaries defined by the mpsz dataset. We achieve this by performing a spatial intersection between the bus stop points and the Singapore boundary (mpsz). This will help create a more accurate visualization that excludes any extraneous data outside Singapore.

Code
# Ensure bus_stops are within the mpsz boundary
bus_stops <- st_intersection(bus_stops, mpsz)
clinics <- st_intersection(clinics, mpsz)
Code
# Visualize the filtered data to confirm that all amenities are within Singapore
tmap_mode("view")

tm_shape(mpsz) +
  tm_polygons(col = "white", border.col = "grey") +
  tm_shape(resale_geo) +
  tm_symbols(col = "yellow", size = 0.1, shape = 19) +
  tm_shape(clinics) +
  tm_symbols(col = "blue", size = 0.1, shape = 19) +
  tm_shape(hawker_center) +
  tm_symbols(col = "red", size = 0.15, shape = 21) +
  tm_shape(train_stations) +
  tm_symbols(col = "green", size = 0.1, shape = 19) +
  tm_shape(bus_stops) +  # Filtered bus stops
  tm_symbols(col = "grey", size = 0.05, shape = 20) +
  tm_shape(supermarkets) +
  tm_symbols(col = "brown", size = 0.1, shape = 19) +
  tm_layout(title = "Filtered Interactive Map of Amenities in Singapore")

5. Data Wrangling

5.1. Structural Factors

Flat Type:

Code
# Filter `resale_geo` into separate data frames based on flat type
resale_geo_3room <- resale_geo %>% filter(flat_type == "3 ROOM")
resale_geo_4room <- resale_geo %>% filter(flat_type == "4 ROOM")
resale_geo_5room <- resale_geo %>% filter(flat_type == "5 ROOM")

Area of the Unit: Check if you have a floor_area_sqm column.

Code
# Convert area to numeric if necessary and handle missing values
resale_geo_4room <- resale_geo_4room %>%
  mutate(floor_area_sqm = as.numeric(floor_area_sqm)) %>%
  drop_na(floor_area_sqm)

Floor Level: If floor information is represented in ranges (e.g., “01 TO 03”), convert it to an approximate floor level (e.g., the median floor of the range). Create a new column, floor_level, to store the converted floor level.

Code
# Convert `storey_range` to approximate floor level
resale_geo_4room  <- resale_geo_4room  %>%
  mutate(floor_level = (as.numeric(str_extract(storey_range, "^[0-9]+")) +
                     as.numeric(str_extract(storey_range, "[0-9]+$"))) / 2)

Age of the Unit:

Code
# Calculate age of the unit
current_year <- as.numeric(format(Sys.Date(), "%Y"))

resale_geo_4room <- resale_geo_4room %>%
  mutate(age = current_year - lease_commence_date)

5.2 Locational Factors

Proximity to Key Amenities Calculate the distance from each HDB flat to various key amenities like MRT, parks, and shopping malls. Use st_distance to find the nearest amenity in each category.

Code
# Function to calculate proximity to the nearest amenity with units in meters
calculate_nearest <- function(resale_data, amenity_data, amenity_name) {
  # Calculate distances and get the minimum for each resale point
  nearest_distances <- st_distance(resale_data, amenity_data) %>%
    apply(1, min)
  
  # Set units to meters if not already
  nearest_distances <- set_units(nearest_distances, "m", mode = "standard")
  
  # Add the distances as a new column in the resale data
  resale_data[[amenity_name]] <- nearest_distances
  return(resale_data)
}

# Calculate distances to amenities for resale_geo_4room
resale_geo_4room <- calculate_nearest(resale_geo_4room, train_stations, "dist_to_mrt")
resale_geo_4room <- calculate_nearest(resale_geo_4room, bus_stops, "dist_to_bus")
resale_geo_4room <- calculate_nearest(resale_geo_4room, supermarkets, "dist_to_supermarket")

Count of Facilities within Radius Use the calculate_amenity_count function to calculate the number of facilities within a specific radius of each flat. For each amenity type, specify a radius and create a new column indicating the count of nearby facilities.

Code
# Function to calculate the count of amenities within a specified radius
calculate_amenity_count <- function(resale_data, amenity_data, radius, amenity_name) {
  # Calculate if each resale point has any amenities within the radius
  nearby_count <- st_is_within_distance(resale_data, amenity_data, dist = radius) %>% lengths()
  
  # Print summary for debugging
  print(paste("Summary for", amenity_name, "within", radius, "meters:"))
  print(summary(nearby_count))
  
  # Add count as a new column in resale_data
  resale_data <- resale_data %>%
    mutate(!!paste0("count_", amenity_name) := nearby_count)
  
  return(resale_data)
}

# Calculate counts for each amenity type within a 500-meter radius
resale_geo_4room <- calculate_amenity_count(resale_geo_4room, clinics, 500, "clinics")
[1] "Summary for clinics within 500 meters:"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   4.000   5.000   6.012   8.000  21.000 
Code
resale_geo_4room <- calculate_amenity_count(resale_geo_4room, hawker_center, 500, "hawker_center")
[1] "Summary for hawker_center within 500 meters:"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.5335  1.0000  4.0000 
Code
resale_geo_4room<- calculate_amenity_count(resale_geo_4room, schoolzone, 300, "schoolzone")
[1] "Summary for schoolzone within 300 meters:"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  1.0000  0.6847  1.0000  3.0000 

Filtering out unnecessary columns and saving the final dataset to a file is a good practice. This will help streamline analysis and avoid redundant calculations.

Code
# Rename columns for resale_geo_4room (similar steps can be applied to other datasets)
resale_new_4room <- resale_geo_4room %>%
  rename(
    MONTH = month,
    TOWN = town,
    FLOOR_AREA_SQM = floor_area_sqm,
    ADDRESS = address,
    FLAT_MODEL = flat_model,
    RESALE_PRICE = resale_price,
    FLOOR_LEVEL = floor_level,
    REMAINING_LEASE_YEARS = remaining_lease_yr,
    UNIT_AGE = age,
    PROX_MRT = dist_to_mrt,
    PROX_BUS_STOP = dist_to_bus,
    PROX_SUPERMARKET = dist_to_supermarket
  ) %>%
  select(
    MONTH, TOWN, FLOOR_AREA_SQM, ADDRESS, FLAT_MODEL, RESALE_PRICE, FLOOR_LEVEL, REMAINING_LEASE_YEARS, UNIT_AGE,
    PROX_MRT, PROX_BUS_STOP, PROX_SUPERMARKET, count_clinics, count_hawker_center, count_schoolzone
  )

6.Exploratory Data Analysis (EDA)

In this section, we’ll use statistical graphics functions from packages like ggplot2 and tmap to conduct exploratory data analysis (EDA) of HDB resale data. This analysis will help us understand the feature distributions and spatial patterns of the data.

6.1 Analyzing HDB Resale Price Distribution

We start by plotting the distribution of resale prices to observe skewness or outliers in the data. The code below creates an initial histogram of resale price distribution:

Code
# Histogram showing HDB resale price distribution
ggplot(data = resale_new_4room, aes(x = RESALE_PRICE)) +
  geom_histogram(bins = 20, color = "black", fill = "light blue") +
  labs(title = "Distribution of HDB Resale Prices", x = "Resale Price", y = "Frequency")

Note

Prominent Peak in Mid-Range Prices: The highest concentration of resale transactions occurs between SGD 450,000 and SGD 600,000, with the peak frequency around SGD 500,000. This suggests that mid-range flats are most common and potentially more accessible to buyers.

Reduced High-End Sales: There are fewer transactions as prices increase beyond SGD 700,000, with even fewer above SGD 1 million. These higher resale prices likely correspond to larger or newer flats, or flats in more premium locations.

Absence of Very Low-End Sales: The distribution has a minimum threshold close to SGD 300,000, which might indicate a baseline resale price in the market. This likely reflects minimum price points for 3-room or 4-room flats with fewer amenities or in less central locations.

Slight Right Skew: Although the majority of transactions are clustered in the middle range, the histogram has a slight right skew due to the long tail extending towards higher prices. This skew reflects that while most flats fall within a certain price range, a small percentage of flats sell at higher price points due to factors such as location, size, or recent upgrades.

Code
tm_shape(mpsz)+
  tm_polygons(col = "white") +
tm_shape(resale_new_4room) +  
  tm_dots(col = "RESALE_PRICE",
          alpha = 0.6,
          style="quantile")
Note

Geographic Distribution of Resale Prices: Higher Prices (Darker Colors): Higher-priced resale flats (740,000 to 1,408,000 SGD) are more concentrated in central regions and areas near the Central Business District (CBD) and popular residential districts. This suggests that location significantly influences resale value, especially proximity to city center and high-demand neighborhoods. Lower Prices (Lighter Colors): Lower-priced flats (355,000 to 530,000 SGD) are more prevalent in peripheral areas like the northern and western parts of Singapore. These regions may be further from the CBD, major transport hubs, or amenities that add value.

Concentration Patterns: Areas with a dense clustering of resale flats (e.g., in the central and northeastern regions) suggest neighborhoods with a high number of transactions. These areas are likely popular for buyers, possibly due to access to amenities such as MRT stations, malls, and schools. In contrast, sparser distributions in northern and eastern areas indicate fewer resale transactions or lower demand, potentially due to fewer amenities or more remote locations.

6.3 Exploring Structural Factors

Investigate structural factors such as floor area, age, remaining lease, and floor level to see how they correlate with resale price.

Code
# Scatter plot for Resale Price vs Floor Area
ggplot(resale_new_4room, aes(x = FLOOR_AREA_SQM, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  labs(title = "Resale Price vs Floor Area", x = "Floor Area (sqm)", y = "Resale Price") +
  theme_minimal()

Code
# Scatter plot for Resale Price vs Age of Flat
ggplot(resale_new_4room, aes(x = UNIT_AGE, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  labs(title = "Resale Price vs Age of Flat", x = "Age of Flat (years)", y = "Resale Price") +
  theme_minimal()

Code
# Scatter plot for Resale Price vs Age of Flat
ggplot(resale_new_4room, aes(x = FLOOR_LEVEL, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "darkred") +
  labs(title = "Resale Price vs Age of Flat", x = "Age of Flat (years)", y = "Resale Price") +
  theme_minimal()

Note

The analysis of resale prices reveals that both floor area and the age of flats are significant factors influencing their resale values. There is a clear positive relationship between floor area and resale price, with larger units generally commanding higher prices, indicating that size is an important determinant for buyers. Conversely, a negative trend is observed between the age of a flat and its resale price; older flats tend to have lower resale values compared to newer ones. This could be due to factors like shorter remaining leases and potentially outdated designs in older flats. Overall, the results suggest that HDB buyers prioritize larger and newer properties, likely for the added space, updated amenities, and longer lease periods.

6.4 Scatter Plots of Resale Price vs. Distance to Amenities

Code
# Scatter plot for Resale Price vs Distance to MRT
ggplot(resale_new_4room, aes(x = PROX_MRT, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "purple") +
  labs(title = "Resale Price vs Distance to MRT", 
       x = "Distance to MRT (meters)", y = "Resale Price") +
  theme_minimal()

Code
# Scatter plot for Resale Price vs Distance to Bus Stop
ggplot(resale_new_4room, aes(x = PROX_BUS_STOP, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "green") +
  labs(title = "Resale Price vs Distance to Bus Stop", 
       x = "Distance to Bus Stop (meters)", y = "Resale Price") +
  theme_minimal()

Code
# Scatter plot for Resale Price vs Distance to Supermarket
ggplot(resale_new_4room, aes(x = PROX_SUPERMARKET, y = RESALE_PRICE)) +
  geom_point(alpha = 0.5, color = "orange") +
  labs(title = "Resale Price vs Distance to Supermarket", 
       x = "Distance to Supermarket (meters)", y = "Resale Price") +
  theme_minimal()

Note

Resale Price vs. Distance to MRT: The plot shows a trend where resale prices generally decrease as the distance from MRT stations increases, especially within the first 500 meters. This suggests that closer proximity to MRT stations is valued by buyers, likely due to the convenience of public transportation.

Resale Price vs. Distance to Bus Stops: The relationship between resale price and distance to bus stops is less pronounced than that of MRT stations. There is a slight concentration of higher resale prices closer to bus stops, particularly within the first 100 meters, but the trend is not as strong. This may indicate that while proximity to bus stops adds some value, it is less influential than proximity to MRT stations.

Resale Price vs. Distance to Supermarkets: Similar to MRT stations, there is a tendency for resale prices to be higher when closer to supermarkets, though this trend is moderate. Convenience to nearby supermarkets appears to be an attractive factor for buyers, possibly reflecting the importance of accessibility to daily necessities.

### 6.5 Count of Amenities within Radius

Code
# Scatter plot for Resale Price vs Count of Nearby Clinics
ggplot(resale_new_4room, aes(x = count_clinics, y = RESALE_PRICE)) +
  geom_jitter(width = 0.2, color = "blue", alpha = 0.5) +
  labs(title = "Resale Price vs Number of Nearby Clinics", 
       x = "Count of Nearby Clinics", y = "Resale Price") +
  theme_minimal()

Code
# Scatter plot for Resale Price vs Count of Nearby Hawker Centers
ggplot(resale_new_4room, aes(x = count_hawker_center, y = RESALE_PRICE)) +
  geom_jitter(width = 0.2, color = "red", alpha = 0.5) +
  labs(title = "Resale Price vs Number of Nearby Hawker Centers", 
       x = "Count of Nearby Hawker Centers", y = "Resale Price") +
  theme_minimal()

Code
# Scatter plot for Resale Price vs Count of Nearby Hawker Centers
ggplot(resale_new_4room, aes(x = count_schoolzone, y = RESALE_PRICE)) +
  geom_jitter(width = 0.2, color = "green", alpha = 0.5) +
  labs(title = "Resale Price vs Number of Nearby Schoolzone", 
       x = "Count of Nearby School zone", y = "Resale Price") +
  theme_minimal()

Note

Resale Price vs. Number of Nearby Clinics: The scatter plot shows a slight trend where having a higher number of nearby clinics is associated with somewhat higher resale prices, although the effect is modest. Properties located near a moderate number of clinics (5-10 clinics) tend to have a higher frequency of mid-to-high resale prices. This suggests that access to healthcare facilities is valued by some buyers but may not be a dominant factor in driving resale prices.

Resale Price vs. Number of Nearby Hawker Centers: There is a noticeable clustering of properties around the 0 to 1 nearby hawker centers, with relatively higher resale prices for those with 1-2 hawker centers nearby. This distribution implies that the presence of hawker centers, which offer affordable food options, may slightly enhance the resale price, likely due to the convenience and lifestyle benefits they provide.

Resale Price vs. Number of Nearby School Zones: The plot indicates that properties with 1 or 2 nearby school zones have a broader range of resale prices, with some reaching higher price points. This trend suggests that proximity to schools may be appealing to families, thereby adding value to properties in school-dense areas. However, beyond 2 school zones, the resale price effect appears minimal, likely because additional nearby schools do not further increase perceived convenience.

6.5Correlation Analysis

Code
resale_new_4room_nogeo <- resale_new_4room %>%
  st_drop_geometry()
ggcorrmat(resale_new_4room_nogeo, names(resale_new_4room_nogeo))

Note

Negative Correlation between UNIT_AGE and REMAINING_LEASE_YEARS: As expected, there is a strong negative correlation between UNIT_AGE and REMAINING_LEASE_YEARS, given that one is derived from the other. This relationship confirms that as a unit ages, its remaining lease decreases correspondingly. Since these two variables are highly collinear, it would be prudent to retain only one of them in further analysis to avoid multicollinearity issues.

Positive Correlations with RESALE_PRICE: FLOOR_AREA_SQM shows a moderate positive correlation with RESALE_PRICE (0.63). This suggests that larger flats tend to have higher resale prices, as expected. FLOOR_LEVEL also has a moderate positive correlation with RESALE_PRICE (0.45), indicating that higher-floor units are often more desirable and fetch higher prices in the resale market.

Minimal Correlations for Most Location-Based Variables: Proximity to amenities such as PROX_MRT, PROX_BUS_STOP, and PROX_SUPERMARKET shows low or non-significant correlations with RESALE_PRICE. This suggests that, for this dataset, proximity to these amenities might not have a strong direct influence on resale prices. The count of nearby amenities, like clinics (count_clinics), hawker centers (count_hawker_center), and school zones (count_schoolzone), also shows minimal correlation with RESALE_PRICE.

7 Building a non-spatial multiple linear regression

7.1 Prepare Data for Regression

Remove REMAINING_LEASE_YEARS and ensure all relevant variables are formatted correctly. You might also need to convert categorical variables, such as TOWN and FLAT_MODEL, into dummy variables or factors if they’re not already.

Code
# Remove the `REMAINING_LEASE_YEARS` column
resale_new_4room <- resale_new_4room %>% select(-REMAINING_LEASE_YEARS)

# Convert categorical variables to factors
resale_new_4room$TOWN <- as.factor(resale_new_4room$TOWN)
resale_new_4room$FLAT_MODEL <- as.factor(resale_new_4room$FLAT_MODEL)


# Set a random seed for reproducibility
set.seed(123)

# Create training (80%) and testing (20%) data splits
train_index <- createDataPartition(resale_new_4room$RESALE_PRICE, p = 0.8, list = FALSE)
train_data <- resale_new_4room[train_index, ]
test_data <- resale_new_4room[-train_index, ]

7.2 Building the Model on the Training Data

Code
# Build the model on the training set
train_mlr <- lm(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone, data = train_data)
ols_regress(train_mlr)
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.780       RMSE                    97612.564 
R-Squared                   0.608       MSE                9528212723.287 
Adj. R-Squared              0.603       Coef. Var                  15.117 
Pred R-Squared              0.597       AIC                     19177.027 
MAE                     76510.381       SBC                     19227.730 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                    ANOVA                                     
-----------------------------------------------------------------------------
                    Sum of                                                   
                   Squares         DF       Mean Square       F         Sig. 
-----------------------------------------------------------------------------
Regression    1.096911e+13          9       1.21879e+12     126.19    0.0000 
Residual      7.069934e+12        732    9658379563.769                      
Total         1.803904e+13        741                                        
-----------------------------------------------------------------------------

                                             Parameter Estimates                                              
-------------------------------------------------------------------------------------------------------------
              model          Beta    Std. Error    Std. Beta       t        Sig          lower         upper 
-------------------------------------------------------------------------------------------------------------
        (Intercept)    302289.645     55615.316                   5.435    0.000    193105.097    411474.192 
     FLOOR_AREA_SQM      3500.832       575.730        0.152      6.081    0.000      2370.552      4631.111 
           UNIT_AGE     -4443.227       291.178       -0.396    -15.259    0.000     -5014.870     -3871.584 
        FLOOR_LEVEL      9279.482       567.665        0.412     16.347    0.000      8165.038     10393.927 
           PROX_MRT       -72.858         9.811       -0.178     -7.426    0.000       -92.119       -53.596 
      PROX_BUS_STOP       195.219        67.556        0.068      2.890    0.004        62.593       327.845 
   PROX_SUPERMARKET        -2.982        25.658       -0.003     -0.116    0.908       -53.354        47.390 
      count_clinics      6459.330      1273.603        0.132      5.072    0.000      3958.979      8959.680 
count_hawker_center     48602.317      5341.484        0.235      9.099    0.000     38115.862     59088.772 
   count_schoolzone    -31888.927      5094.498       -0.151     -6.259    0.000    -41890.498    -21887.357 
-------------------------------------------------------------------------------------------------------------
Code
tbl_regression(train_mlr, intercept = TRUE)
Characteristic Beta 95% CI1 p-value
(Intercept) 302,290 193,105, 411,474 <0.001
FLOOR_AREA_SQM 3,501 2,371, 4,631 <0.001
UNIT_AGE -4,443 -5,015, -3,872 <0.001
FLOOR_LEVEL 9,279 8,165, 10,394 <0.001
PROX_MRT -73 -92, -54 <0.001
PROX_BUS_STOP 195 63, 328 0.004
PROX_SUPERMARKET -3.0 -53, 47 >0.9
count_clinics 6,459 3,959, 8,960 <0.001
count_hawker_center 48,602 38,116, 59,089 <0.001
count_schoolzone -31,889 -41,890, -21,887 <0.001
1 CI = Confidence Interval
Note

R-Squared (0.608): This indicates that approximately 60.8% of the variance in RESALE_PRICE is explained by the model. While this is a reasonable fit, there may still be room for improvement.

Significant Predictors: Most variables are statistically significant (p-value < 0.05), meaning they contribute meaningfully to the prediction of resale prices.

Positive Predictors: FLOOR_AREA_SQM, FLOOR_LEVEL, count_clinics, count_hawker_center, and PROX_BUS_STOP have positive coefficients, indicating that larger units, higher floors, and proximity to certain amenities like clinics and hawker centers are associated with higher resale prices.

Negative Predictors: UNIT_AGE, PROX_MRT, and count_schoolzone have negative coefficients, suggesting that older units, proximity to MRT stations, and nearby school zones are associated with lower resale prices.

Non-Significant Predictor: PROX_SUPERMARKET has a high p-value (0.908), suggesting it is not a significant predictor in this model.

Check for Multicollinearity

Code
ols_vif_tol(train_mlr)
            Variables Tolerance      VIF
1      FLOOR_AREA_SQM 0.8599615 1.162843
2            UNIT_AGE 0.7969074 1.254851
3         FLOOR_LEVEL 0.8432154 1.185937
4            PROX_MRT 0.9329164 1.071907
5       PROX_BUS_STOP 0.9648531 1.036427
6    PROX_SUPERMARKET 0.9265095 1.079320
7       count_clinics 0.7962761 1.255846
8 count_hawker_center 0.8044852 1.243031
9    count_schoolzone 0.9201214 1.086813
Note

The VIF (Variance Inflation Factor) values in this multicollinearity check are all well below the commonly accepted threshold of 5 (or even 10 in some cases). This means that multicollinearity is not a concern for this model, as each predictor variable shows minimal correlation with the others.

7.3 Adjustments to Improve the Model

Since PROX_SUPERMARKET has a very high p-value, removing it could simplify the model without losing predictive power.

Additionally, some predictors, such as FLOOR_AREA_SQM or UNIT_AGE, could be transformed if their relationships with RESALE_PRICE are non-linear (e.g., log transformation for positive, skewed predictors).

Code
# Re-fit the model without PROX_SUPERMARKET
train_data2 <- train_data %>%
    mutate(LOG_RESALE_PRICE = log(RESALE_PRICE))

train_mlr2 <- lm(LOG_RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE + FLOOR_LEVEL + PROX_MRT + 
                PROX_BUS_STOP + count_clinics + count_hawker_center + count_schoolzone, 
                data = train_data2)
ols_regress(train_mlr2)
                         Model Summary                           
----------------------------------------------------------------
R                       0.786       RMSE                  0.134 
R-Squared               0.617       MSE                   0.018 
Adj. R-Squared          0.613       Coef. Var             1.008 
Pred R-Squared          0.608       AIC                -859.410 
MAE                     0.107       SBC                -813.316 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
               Sum of                                                
              Squares         DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression     21.403          8          2.675    147.665    0.0000 
Residual       13.280        733          0.018                      
Total          34.683        741                                     
---------------------------------------------------------------------

                                       Parameter Estimates                                        
-------------------------------------------------------------------------------------------------
              model      Beta    Std. Error    Std. Beta       t        Sig      lower     upper 
-------------------------------------------------------------------------------------------------
        (Intercept)    12.830         0.076                 169.904    0.000    12.681    12.978 
     FLOOR_AREA_SQM     0.006         0.001        0.178      7.235    0.000     0.004     0.007 
           UNIT_AGE    -0.007         0.000       -0.447    -17.582    0.000    -0.008    -0.006 
        FLOOR_LEVEL     0.012         0.001        0.378     15.203    0.000     0.010     0.013 
           PROX_MRT     0.000         0.000       -0.189     -7.981    0.000     0.000     0.000 
      PROX_BUS_STOP     0.000         0.000        0.065      2.804    0.005     0.000     0.000 
      count_clinics     0.009         0.002        0.132      5.221    0.000     0.006     0.012 
count_hawker_center     0.066         0.007        0.228      8.970    0.000     0.051     0.080 
   count_schoolzone    -0.045         0.007       -0.155     -6.504    0.000    -0.059    -0.032 
-------------------------------------------------------------------------------------------------
Note

The re-fitted multiple linear regression model, using LOG_RESALE_PRICE as the dependent variable, appears to show improved performance compared to the previous model in a few key areas. R-Squared: The R-squared value is now 0.617, meaning that approximately 61.7% of the variance in log resale prices is explained by the model. This is a relatively strong level of explanatory power for a real estate model. Adjusted R-Squared: At 0.613, this value suggests that the model remains strong even after adjusting for the number of predictors. RMSE: The Root Mean Square Error (RMSE) of 0.134 (on the log scale) suggests a good fit with reduced residual variance. AIC: A lower AIC (-859.410) indicates that this model may be preferred over previous versions, as a lower AIC generally signifies a better model fit.

7.4 Test for Normality

Code
ols_plot_resid_hist(train_mlr2)

Note

Symmetry: The residuals are fairly symmetric around zero, though there is a slight skew to the right. Ideally, for the normality assumption to hold fully, we would expect the histogram to be more symmetric.

Approximate Normality: The shape is close to normal, with a high concentration of residuals near zero and fewer residuals at the tails. This indicates that the model does not systematically over- or under-predict across the range of data. While some deviation from normality is visible in the tails, this deviation is minor and not unusual in practical applications.

7.5 Test for Spatial Autocorrelation

Code
mlr_res <- as.data.frame(train_mlr2$residuals)
resale_res <- cbind(train_data2,
                    mlr_res) %>%
  rename(MLR_RES = train_mlr2.residuals)

tm_shape(mpsz)+
  tm_polygons(col = "white") +
tm_shape(resale_res) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style="quantile") +
  tm_layout(main.title = "Multiple Linear Regression Residuals",     
            main.title.position = "center",
            main.title.size = 1)
Code
tmap_mode("plot")

The figure above reveal that there is sign of spatial autocorrelation.

To proof that our observation is indeed true, the Moran’s I test will be performed

Code
resale_sp <- as_Spatial(resale_res)

nb <- dnearneigh(coordinates(resale_sp), 0, 2000, longlat = FALSE)

nb_lw <- nb2listw(nb, style = 'W')

lm.morantest(train_mlr2, nb_lw)

    Global Moran I for regression residuals

data:  
model: lm(formula = LOG_RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE +
FLOOR_LEVEL + PROX_MRT + PROX_BUS_STOP + count_clinics +
count_hawker_center + count_schoolzone, data = train_data2)
weights: nb_lw

Moran I statistic standard deviate = 50.353, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    4.399105e-01    -3.350891e-03     7.749292e-05 
Note

Moran’s I Value: The observed Moran’s I value is 0.4399, indicating positive spatial autocorrelation in the residuals. This suggests that the residuals are spatially clustered—neighboring locations have similar residual values. This indicates that the model may not have fully accounted for spatial dependencies.

Significance Level: The p-value is < 2.2e-16, which is far below the standard threshold of 0.05. This means the result is highly statistically significant, and we can reject the null hypothesis that the residuals are randomly distributed. This further confirms the spatial clustering of the residuals.

8.GWR & GWRF Predictive Method

8.1 Converting to SpatialPointsDataFrame

Code
train_data_sp <- as_Spatial(train_data)

8.2 Computing Adaptive Bandwidth

To identify the optimal bandwidth, use cross-validation (CV) within the bw.gwr() function. This step helps in determining the appropriate number of neighbors for each local regression.

Code
bw_adaptive <- bw.gwr(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone,
                      data=train_data_sp,
                      approach="CV",
                      kernel="gaussian",
                      adaptive=TRUE,
                      longlat=FALSE)
Adaptive bandwidth: 466 CV score: 6.46379e+12 
Adaptive bandwidth: 296 CV score: 5.894423e+12 
Adaptive bandwidth: 190 CV score: 5.159191e+12 
Adaptive bandwidth: 125 CV score: 4.260186e+12 
Adaptive bandwidth: 84 CV score: 3.483417e+12 
Adaptive bandwidth: 59 CV score: 2.624868e+12 
Adaptive bandwidth: 43 CV score: 2.08822e+12 
Adaptive bandwidth: 34 CV score: 1.906203e+12 
Adaptive bandwidth: 27 CV score: 1.772778e+12 
Adaptive bandwidth: 24 CV score: 1.761192e+12 
Adaptive bandwidth: 21 CV score: 1.702006e+12 
Adaptive bandwidth: 20 CV score: 1.690141e+12 
Adaptive bandwidth: 18 CV score: 1.642465e+12 
Adaptive bandwidth: 18 CV score: 1.642465e+12 
Code
# Save bandwidth result for reuse
write_rds(bw_adaptive, "data/rds/bw_adaptive.rds")

8.3 Constructing the GWR Model

Using the saved bandwidth, fit the GWR model with an adaptive Gaussian kernel.

Code
gwr_adaptive <- gwr.basic(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone,
                          data=train_data_sp,
                          bw=bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE,
                          longlat = FALSE)

# Save the GWR model for future use
write_rds(gwr_adaptive, "data/rds/gwr_adaptive.rds")

8.4 Retrieving GWR Output

Load the saved model and display the output to examine coefficient values and other diagnostics.

Code
gwr_adaptive <- read_rds("data/rds/gwr_adaptive.rds")
print(gwr_adaptive)
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-11-15 13:54:00.830474 
   Call:
   gwr.basic(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE + 
    FLOOR_LEVEL + PROX_MRT + PROX_BUS_STOP + PROX_SUPERMARKET + 
    count_clinics + count_hawker_center + count_schoolzone, data = train_data_sp, 
    bw = bw_adaptive, kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  RESALE_PRICE
   Independent variables:  FLOOR_AREA_SQM UNIT_AGE FLOOR_LEVEL PROX_MRT PROX_BUS_STOP PROX_SUPERMARKET count_clinics count_hawker_center count_schoolzone
   Number of data points: 742
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-227173  -64473  -17631   57356  406317 

   Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
   (Intercept)         302289.645  55615.316   5.435 7.45e-08 ***
   FLOOR_AREA_SQM        3500.832    575.730   6.081 1.93e-09 ***
   UNIT_AGE             -4443.227    291.178 -15.259  < 2e-16 ***
   FLOOR_LEVEL           9279.482    567.665  16.347  < 2e-16 ***
   PROX_MRT               -72.858      9.811  -7.426 3.12e-13 ***
   PROX_BUS_STOP          195.219     67.556   2.890  0.00397 ** 
   PROX_SUPERMARKET        -2.982     25.658  -0.116  0.90751    
   count_clinics         6459.330   1273.603   5.072 5.00e-07 ***
   count_hawker_center  48602.317   5341.484   9.099  < 2e-16 ***
   count_schoolzone    -31888.927   5094.498  -6.259 6.58e-10 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 98280 on 732 degrees of freedom
   Multiple R-squared: 0.6081
   Adjusted R-squared: 0.6033 
   F-statistic: 126.2 on 9 and 732 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 7.069934e+12
   Sigma(hat): 97744.38
   AIC:  19177.03
   AICc:  19177.39
   BIC:  18558.43
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 18 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                              Min.     1st Qu.      Median     3rd Qu.
   Intercept           -4.6538e+05  3.3412e+05  4.3010e+05  5.3961e+05
   FLOOR_AREA_SQM      -2.3396e+01  2.6054e+03  3.2933e+03  5.0034e+03
   UNIT_AGE            -1.0103e+04 -7.2183e+03 -4.9476e+03 -4.0259e+03
   FLOOR_LEVEL          3.6600e+02  3.5074e+03  5.0000e+03  6.1503e+03
   PROX_MRT            -2.9873e+02 -1.5011e+02 -9.6621e+01 -3.7140e+01
   PROX_BUS_STOP       -7.6902e+02 -5.6302e+01 -3.7344e+00  1.0510e+02
   PROX_SUPERMARKET    -3.1889e+02 -6.6561e+01 -4.8243e+00  4.1944e+01
   count_clinics       -1.4930e+04 -1.7624e+03  1.9888e+03  5.1396e+03
   count_hawker_center -5.7545e+04 -8.4636e+03  4.6069e+03  1.6099e+04
   count_schoolzone    -7.5350e+04 -1.2103e+04 -2.2751e+03  8.3975e+03
                             Max.
   Intercept           840576.850
   FLOOR_AREA_SQM       11576.366
   UNIT_AGE             -1055.267
   FLOOR_LEVEL          12466.447
   PROX_MRT                63.306
   PROX_BUS_STOP          539.328
   PROX_SUPERMARKET       194.674
   count_clinics        18448.559
   count_hawker_center 157597.607
   count_schoolzone     79652.802
   ************************Diagnostic information*************************
   Number of data points: 742 
   Effective number of parameters (2trace(S) - trace(S'S)): 271.5186 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 470.4814 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 18183.9 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 17782.3 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 18257.58 
   Residual sum of squares: 829729534739 
   R-square value:  0.9540037 
   Adjusted R-square value:  0.9274023 

   ***********************************************************************
   Program stops at: 2024-11-15 13:54:00.946882 

8.5 Preparing Coordinates Data

Extract the x and y coordinates from the sf objects, which will be useful for spatial modeling.

Code
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)

write_rds(coords_train, "data/rds/coords_train.rds")
write_rds(coords_test, "data/rds/coords_test.rds")

train_data <- train_data %>% st_drop_geometry()

8.6 Calibrating Random Forest Model

Using the ranger package, fit a Random Forest model on train_data.

Code
set.seed(1234)
rf <- ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone,
             data=train_data)

# Save the model
write_rds(rf, "data/rds/rf.rds")

8.7Calibrating Geographical Random Forest Model

Using the grf function in the SpatialML package

Code
set.seed(1234)
gwRF_adaptive <- grf(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE+ FLOOR_LEVEL + PROX_MRT +PROX_BUS_STOP+ PROX_SUPERMARKET + count_clinics + count_hawker_center + count_schoolzone,
                     dframe=train_data, 
                     bw=55,
                     kernel="adaptive",
                     coords=coords_train)
Ranger result

Call:
 ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + UNIT_AGE + FLOOR_LEVEL +      PROX_MRT + PROX_BUS_STOP + PROX_SUPERMARKET + count_clinics +      count_hawker_center + count_schoolzone, data = train_data,      num.trees = 500, mtry = 3, importance = "impurity", num.threads = NULL) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      742 
Number of independent variables:  9 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       6990351252 
R squared (OOB):                  0.7128534 
     FLOOR_AREA_SQM            UNIT_AGE         FLOOR_LEVEL            PROX_MRT 
       1.405120e+12        4.230223e+12        4.405082e+12        1.741839e+12 
      PROX_BUS_STOP    PROX_SUPERMARKET       count_clinics count_hawker_center 
       1.183246e+12        1.057512e+12        1.302310e+12        1.654157e+12 
   count_schoolzone 
       4.153670e+11 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-246290  -29308   -3210   -2081   23492  298915 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-60678.5  -5377.1   -402.8   -138.7   4941.4  81329.7 
                           Min          Max         Mean          StD
FLOOR_AREA_SQM      7188555240 4.909814e+11  73922810814  85391993951
UNIT_AGE            9268444838 1.029969e+12 252150186902 324210542482
FLOOR_LEVEL         6986411093 9.273832e+11  99070468430 142186160540
PROX_MRT            4089299733 4.688317e+11  77971884997 100395411795
PROX_BUS_STOP       4928756987 3.182668e+11  38149218960  48704790235
PROX_SUPERMARKET    5019768869 2.120083e+11  46488733804  42627504945
count_clinics       2323146882 3.780597e+11  52819903249  76273846773
count_hawker_center          0 1.602772e+11  14999348883  21455875752
count_schoolzone     779705563 3.588960e+11  18693503937  45134421838
Code
write_rds(gwRF_adaptive, "data/rds/gwRF_adaptive.rds")
Code
variable_importance <- gwRF_adaptive$Global.Model$variable.importance
sort(variable_importance, decreasing = TRUE)
        FLOOR_LEVEL            UNIT_AGE            PROX_MRT count_hawker_center 
       4.405082e+12        4.230223e+12        1.741839e+12        1.654157e+12 
     FLOOR_AREA_SQM       count_clinics       PROX_BUS_STOP    PROX_SUPERMARKET 
       1.405120e+12        1.302310e+12        1.183246e+12        1.057512e+12 
   count_schoolzone 
       4.153670e+11 
Note

Based on the output, the top five parameters by importance in the geographically weighted random forest model are: FLOOR_LEVEL

UNIT_AGE

PROX_MRT

FLOOR_AREA_SQM

count_hawker_center

9.Model Evaluation

For the Model Evaluation section, assess the performance of each of your three models: GWR (Geographically Weighted Regression), GWRF (Geographically Weighted Random Forest), and MLR (Multiple Linear Regression).

Model Evaluation Metrics: Root Mean Square Error (RMSE): Measures the average magnitude of error. A lower RMSE indicates better model accuracy. Mean Absolute Error (MAE): Shows the average absolute error, providing insight into the average deviation of predictions from actual values. R-squared (R²): Explains the proportion of variance in the dependent variable that is predictable from the independent variables. Adjusted R-squared: Adjusts R² for the number of predictors in the model, providing a more reliable metric when comparing models with different numbers of variables.

9.1 Prepare Test Data

Code
test_data2 <- test_data %>%
    mutate(LOG_RESALE_PRICE = log(RESALE_PRICE))

coords_test <- st_coordinates(test_data)

resale_test_nogeo <- cbind(test_data, coords_test) %>%
  st_drop_geometry()

9.2 Multiple Linear Regression

Code
test_data2$PREDICT_LOG_MLR <- predict(train_mlr2, newdata = test_data2)


test_data2$PREDICT_MLR <- exp(test_data2$PREDICT_LOG_MLR) 
test_data2$ACTUAL_RESALE_PRICE <- exp(test_data2$LOG_RESALE_PRICE)  


rmse_mlr <- rmse(test_data2$ACTUAL_RESALE_PRICE, test_data2$PREDICT_MLR)
mae_mlr <- mae(test_data2$ACTUAL_RESALE_PRICE, test_data2$PREDICT_MLR)


r_squared_mlr <- summary(train_mlr2)$r.squared
adj_r_squared_mlr <- summary(train_mlr2)$adj.r.squared


cat("MLR RMSE:", rmse_mlr, "MAE:", mae_mlr, "R_squared:", r_squared_mlr, "Adjusted R_squared:", adj_r_squared_mlr, "\n")
MLR RMSE: 92155.49 MAE: 66285.86 R_squared: 0.617096 Adjusted R_squared: 0.612917 

9.3 Geographically Weighted Random Forest (GWRF)

Code
gwRF_pred <- predict.grf(
  gwRF_adaptive,               # the fitted GWRF model
  resale_test_nogeo,           # test data without geometry
  x.var.name = "X",            # column name for X coordinate
  y.var.name = "Y",            # column name for Y coordinate
  local.w = 1,                 # local weight
  global.w = 0                 # global weight
)

GRF_pred_df <- data.frame(GRF_pred = gwRF_pred)
rmse_gwrf <- rmse(resale_test_nogeo$RESALE_PRICE, GRF_pred_df$GRF_pred)
mae_gwrf <- mae(resale_test_nogeo$RESALE_PRICE, GRF_pred_df$GRF_pred)
ss_res <- sum((resale_test_nogeo$RESALE_PRICE - GRF_pred_df$GRF_pred)^2)
ss_tot <- sum((resale_test_nogeo$RESALE_PRICE - mean(resale_test_nogeo$RESALE_PRICE))^2)
r_squared_gwrf <- 1 - (ss_res / ss_tot)
n <- nrow(resale_test_nogeo)  # number of observations
p <- length(gwRF_adaptive$Global.Model$variable.importance)  # number of predictors
adj_r_squared_gwrf <- 1 - ((1 - r_squared_gwrf) * (n - 1) / (n - p - 1))

cat("GWRF RMSE:", rmse_gwrf, "MAE:", mae_gwrf, "R_squared:", r_squared_gwrf, "Adjusted R_squared:", adj_r_squared_gwrf, "\n")
GWRF RMSE: 58630.75 MAE: 39470.21 R_squared: 0.847518 Adjusted R_squared: 0.839631 

9.5 Evaluation_summary

Code
evaluation_summary <- data.frame(
  Model = c("MLR" , "GWRF"),
  RMSE = c(rmse_mlr, rmse_gwrf),
  MAE = c(mae_mlr, mae_gwrf),
  R_Squared = c(r_squared_mlr, r_squared_gwrf),
  Adjusted_R_Squared = c(adj_r_squared_mlr, adj_r_squared_gwrf)
)

print(evaluation_summary)
  Model     RMSE      MAE R_Squared Adjusted_R_Squared
1   MLR 92155.49 66285.86  0.617096           0.612917
2  GWRF 58630.75 39470.21  0.847518           0.839631

A lower RMSE indicates that the GWRF model has less prediction error compared to MLR. This suggests that GWRF is more accurate in predicting the resale price.

Similar to RMSE, a lower MAE in GWRF shows that on average, its predictions are closer to the actual values than those of the MLR model.

R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables. An R-squared of 0.8475 for GWRF suggests it explains about 85% of the variance in resale prices, whereas MLR explains only about 62%.

Adjusted R-squared accounts for the number of predictors in the model, giving a more accurate measure of the model’s explanatory power. The higher Adjusted R-squared for GWRF indicates that it provides a significantly better fit than MLR, even after accounting for model complexity.

10.Conclusion

The GWRF model shows superior performance in predicting resale prices, with lower RMSE and MAE values, and higher R-squared and Adjusted R-squared values. This suggests that GWRF captures more spatial variation and complex relationships in the data compared to the MLR model. The spatial nature of GWRF likely accounts for the improvement, as it can model local variations more effectively than MLR, which assumes a global relationship.